home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-03-03 | 38.1 KB | 1,138 lines |
-
- draft X.400 use of extended character sets Apr 92
-
-
- X.400 use of extended character sets
-
- Fri Nov 6 15:13:56 MET 1992
-
-
- Harald Tveit Alvestrand
- SINTEF DELAB
- Harald.Alvestrand@delab.sintef.no
-
-
-
-
-
-
- Status of this Memo
-
- This draft document is being circulated for comment.
-
- If consensus is reached it may be submitted to the RFC editor as a
- Proposed Standard protocol specificiation, for use in X.400 in the
- Internet.
-
- Please send comments to the author, or to the RARE WG-MSG list
- <wg-msg@rare.nl>.
-
- The following text is required by the Internet-draft rules:
-
- This document is an Internet Draft. Internet Drafts are working
- documents of the Internet Engineering Task Force (IETF), its
- Areas, and its Working Groups. Note that other groups may also
- distribute working documents as Internet Drafts.
-
- Internet Drafts are draft documents valid for a maximum of six
- months. Internet Drafts may be updated, replaced, or obsoleted by
- other documents at any time. It is not appropriate to use
- Internet Drafts as reference material or to cite them other than
- as a "working draft" or "work in progress."
-
- Please check the I-D abstract listing contained in each Internet
- Draft directory to learn the current status of this or any other
- Internet Draft.
-
-
-
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 1]
-
- draft X.400 use of extended character sets Apr 92
-
-
- 1. Introduction
-
- Since 1988, X.400 has had the capacity for carrying a large number
- of different character sets in a message by using the body part
- "GeneralText" defined by ISO/IEC 10021-7.
-
- Since 1992, the Internet also has the means of passing around
- messages containing multiple character sets, by using the
- mechanism defined in RFC-MIME.
-
- This document defines a suggested method of using "GeneralText" in
- order to harmonize as much as possible the usage of this body
- part.
-
-
- 2. General principles
-
- 2.1. Goals
-
- The target of this memo is to define a way of using existing
- standards to achieve:
-
-
- (1) in the short term, a standard for sending E-mail in the
- European languages (Latin letters with European accents,
- Greek and Cyrillic)
-
- (2) in the medium term, extending this to cover the Hebrew and
- Arabic character sets
-
- (3) in the long term, opening up true international E-mail by
- allowing the full character set specified in ISO-10646 to be
- used.
-
-
- The author believes that this document gives a specification that
- can easily accomodate the use of any character set in the ISO
- registry, and, by giving guidance rules for choosing character
- sets, will help interworking.
-
-
-
-
-
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 2]
-
- draft X.400 use of extended character sets Apr 92
-
-
- 2.2. Families of character sets
-
-
- 2.2.1. ISO 6937/T.61
-
- ISO 6937 is a code technique used and recommended in T.51 and
- T.101 (Teletex and Videotex service) and in X.500, providing a
- repertoire of 333 characters from the Latin script by use of non-
- spacing diacritical marks. It corresponds closely to CCITT
- recommendation T.61.
-
- The problem with that technique is that the character stream comes
- in two modes, i.e some characters are coded with one byte and some
- with two (composite characters). This makes information processing
- systems such as an E-mail UA or GW more complex.
-
- It is also not extensible to other languages like Korean or
- Chinese, or even Greek, without invoking the character set
- switching techniques of ISO 2022.
-
-
- 2.2.2. ISO 8859
-
- ISO 8859 defines a set of character sets, each suitable for use in
- some group of languages. Each character in ISO 8859 is coded in a
- single byte.
-
- There are currently 9 parts of ISO 8859, plus a "supplementary"
- set, registered as ISO IR 154. All languages using single-byte
- characters can be written in one or another of the ISO 8859 sets.
- There are sets covering Greek, Hebrew and Arabic.
-
- All the ISO 8859 sets include US ASCII as a subset. All use 8
- bits.
-
- ISO 8859 is regarded by many as a solution; for instance, the X
- windows system now comes with ISO-8859-1 as the "standard"
- character set, with the possibility of specifying others. But
- since the same applications often do not support character set
- switching within text, it is problematic to use these in a truly
- multilingual environment. (Also, most fonts claiming to be "ISO-
- 8859-1" in X11R5 are actually 7-bit fonts. The implied lie is very
- unfortunate.)
-
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 3]
-
- draft X.400 use of extended character sets Apr 92
-
-
- It turns out to work fine, however, if the second language is
- English, since this can be written in all ISO 8859 sets.
-
- The parts 3 and 4 have not seen wide acceptance, and it is
- expected that they will be discarded. They should therefore not be
- used.
-
-
- 2.2.3. ISO 10646
-
- At the moment of writing, ISO 10646 has just been accepted as an
- International Standard. It is basically a 32-bit character set,
- with all of the currently used characters being numbered by the
- first 16 bits, leaving some room for expansion.
-
- It is not possible to use ISO 10646 as a normal character set,
- because it does not conform to the rules for usage of byte values
- set down in ISO 2022 and other places; it uses the "control space"
- for (parts of) graphic character codes.
-
- There are a number of ways to encode ISO 10646 characters "on the
- wire". There are methods within the ISO 2022 standard to switch to
- these, either as "other coding system without return" or as "other
- coding system with return" (that is, you can go back from it to
- the one you came from using an ISO 2022 escape sequence).
-
- The following registrations have been made:
-
-
- ISO 10646 UCS-2 Level 1 has been registered with ESC 2/5 2/15 4/0,
- ISO 10646 UCS-4 Level 1 has been registered with ESC 2/5 2/15 4/1,
-
- The following are applied for:
-
- Reg# Escape sequence Standard/Sponsor Description
- 174 ESC 2/5 2/15 F ISO/IEC 10646 UCS-2, Level 2
- 175 ESC 2/5 2/15 F ISO/IEC 10646 UCS-4, Level 2
- 176 ESC 2/5 2/15 F ISO/IEC 10646 UCS-2, Level 3
- 177 ESC 2/5 2/15 F ISO/IEC 10646 UCS-4, Level 3
- 178 ESC 2/5 F ISO/IEC 10646 UTF-1
-
- << NOTE: The registration numbers for UCS-2 level 1 and UCS-4
- level 1 are not known. Neither are the assigned final characters
- for the other sets. Information requested!>>
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 4]
-
- draft X.400 use of extended character sets Apr 92
-
-
- This character set will become very important in the future, but
- at the moment, few systems are able to support this directly.
-
- The GeneralText body part can be used for carrying any of these
- character sets.
-
-
- 2.3. Body parts that can be used in X.400
-
- At the moment, no established way of transferring a full set of
- characters in X.400-based E-mail exists. In the future, it is
- likely that a new body part, based in ISO 10646, will be
- available; it is, however, dangerous to try to specify this body
- part before ISO 10646 is final.
-
- In the short term, the deployed and available body parts are:
-
-
- (1) IA5Text
-
- (2) For X.400/84: ISO6937Text and Teletex
-
- (3) For X.400/88: GeneralText
-
- IA5Text is the method of choice for E-mail that contains only
- characters from IA5 (equivalent to ASCII).
-
- The ISO6937Text body part is defined in the ISO DIS documents
- corresponding to X.400(84) [MOTIS-86]; these never became a
- standard, so they are now quite difficult to find. It is in
- principle limited to using text that can be presented in ISO 6937,
- but since ISO 6937 refers to the ISO 2022 method of changing
- character sets, it is theoretically possible to use any ISO
- registered character set, but there is no facility for announcing
- the character sets used. This makes interworking with equipment
- that does not support the same character sets complex.
-
- It is still, however, the only body part suitable for carrying
- non-paginated text in non-basic character sets in X.400(84).
-
- Teletex, which is identical in all versions of the X.400 standard,
- has the same problem of implicit ISO6937, but has the added
- problem that it also specifies a page format, with, for instance,
- a left margin of 5 character positions. This is often not
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 5]
-
- draft X.400 use of extended character sets Apr 92
-
-
- desirable.
-
- The details of Teletex are specified in recommendation T.51 and
- its relatives.
-
- GeneralText is defined in ISO 10021-8, the part of [MOTIS] that
- corresponds to CCITT recommendation [X.420]. It is an Extended
- body part, so no modification to CCITT implementations is needed
- to carry it.
-
- GeneralText is suitable for interchange, since it has got proper
- announcement facilities. It can use any number of character sets,
- and announces them both in the Encoded Information Types of the
- X.400 envelope and the parameters of the body part.
-
- We recommend this body part for carrying unformatted text in
- X.400/88.
-
-
- 3. GUIDELINES FOR THE GENERATION OF GENERALTEXT
-
-
- 3.1. Formal definition of GeneralText
-
- A GeneralText message is a byte stream that contains characters
- and character switching sequences according to [ISO 2022].
-
- The X.400 ASN.1 definition of the GeneralText body part is:
-
-
- general-text-body-part EXTENDED-BODY-PART-TYPE
- PARAMETERS GeneralTextParameters IDENTIFIED BY id-ep-general-text
- DATA GeneralTextData
- ::= id-et-general-text
-
- GeneralTextParameters ::= SET OF CharacterSetRegistration
-
- CharacterSetRegistration ::= INTEGER (1..32767)
-
- GeneralTextData ::= GeneralString
-
-
- The definition is from ISO/IEC 10021-7 [MOTIS], Annex I, with
- modifications made in the MHS Implementor' Guide, version 8,
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 6]
-
- draft X.400 use of extended character sets Apr 92
-
-
- chapter 3.6.3, bullet F130. It does not appear in the CCITT
- version of the standards.
-
-
- 3.2. Brief description of ISO 2022 character set switching
-
- There are 4 graphic character sets active at any time in a
- GeneralText message, called G0, G1, G2 and G3. In addition, there
- are 2 control character sets, called C0 and C1.
-
- At any moment, one of the sets G0-G3 is active in code positions
- 2/1 to 7/14, and another is active in code positions 10/0 to
- 15/15. The setting is achieved by so-called "locking shift"
- sequences.
-
- (Formally, code positions 2/0 and 7/15 are reserved for "space"
- and "DEL" respectively, and only 94-character character sets can
- be used in G0. In practice, this restriction is sometimes ignored)
-
- Single characters from the non-active sets may be invoked by the
- use of "single shift" sequences.
-
- The control character sets always occupy the code positions 0/0 to
- 1/15 (C0) and 8/0 to 9/15 (C1).
-
- The character sets currently active as G0-G3 and C0-C1 may be
- changed using "character set designating sequences".
-
- At the beginning of a GeneralText message, one must always assume
- that set 2 (IA5) is active as G0, shifted into the lower half,
- that set 1 (standard) is active as C0, and that no G1-G3 or C1 set
- is invoked. This is specified in the definition of "GeneralString"
- in [X.209], the definition of ASN.1 encoding (section 23.5.2).
-
- If this is not a suitable initial state, a message must always
- start with the necessary announcers and escape sequences to
- designate and invoke the character sets that are actually used.
- The character sets in use may be changed later in the message by
- use of escape sequences.
-
- The parameters of a GeneralText message always list all the
- character sets used, by quoting their ISO reference numbers.
-
- It is impossible to use a character set not registered with ISO in
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 7]
-
- draft X.400 use of extended character sets Apr 92
-
-
- a GeneralText message.
-
- It is also impossible to decide on the true meaning of a byte in a
- GeneralText message without scanning the whole message for shift
- and escape sequences.
-
-
- 3.3. How to use the character sets
-
- RECOMMENDATION:
-
- When the text to be rendered is representable in one of the
- character sets of ISO-8859, the G0 set should be set to ISO 646
- International Reference Version (1991), also called US-ASCII,
- ISO-IR-6.
-
- The older character set ISO-IR-2, ISO 646 IRV(1983), should NOT be
- used. This means that the escape sequence ESC 2/8 4/2
- (designating ASCII as G0) should always occur at the beginning of
- the message.
-
- The G1 set should be set to the relevant ISO-8859 part. G2 and G3
- are not used.
-
- This corresponds to the first level of ISO 4873 usage.
-
- For the currently defined parts of ISO 8859, the character set
- designations are (relative to ISO 8859:1987):
-
- Part ISO IR name Escape sequence Remarks
- for G1 use
-
- 1 ISO-IR-100 Esc 2D 41 West Europe
- 2 ISO-IR-101 Esc 2D 42 East European
- 3 ISO-IR-109 Esc 2D 43
- 4 ISO-IR-110 Esc 2D 44
- 5 ISO-IR-144 Esc 2D 4C Cyrillic
- 6 ISO-IR-127 Esc 2D 47 Arabic
- 7 ISO-IR-126 Esc 2D 46 Greek
- 8 ISO-IR-138 Esc 2D 48 Hebrew
- 9 ISO-IR-148 Esc 2D 4D Baltic, Turkish
-
- NOTE: The use of ISO 8859-3 and ISO 8859-4 is NOT recommended if
- other possibilities exist.
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 8]
-
- draft X.400 use of extended character sets Apr 92
-
-
- The G1 set should be permanently shifted into the upper half of
- the code page.
-
- When the text is not representable in one of the ISO-8859
- character sets, the following rules may be applied:
-
-
- (1) If any Latin characters are used, keep IA5 as the G0 set.
-
- (2) If a mainstream character set is used (Greek, Cyrillic,
- Hebrew, Arabic), designate this as the G1 character set, and
- permanently shift this into the upper half of the code page
- (LS1R).
- EXCEPTION: The Japanese community has a long tradition of
- switching between the Japanese 16-bit character set ISO-IR-87
- and USASCII as the G0 set. See [RFC-2022-JP] for details. If
- ISO-IR-87 is used, that technique should be used instead of
- the one recommended here.
-
- (3) If occasional extensions to a character set that is basically
- Latin occur (like accents, national variants and so on), and
- these are available in a single character set, designate the
- relevant set as G2 and use single shift (SS2) to invoke
- characters from this character set.
-
- The ISO 8859 supplementary set, ISO-IR-154, is recommended
- for this purpose.
-
- This corresponds to the ISO 4873 "second level" application.
-
- (4) If two non-Latin character sets are used, the second should
- be designated as G3, and shifted into the upper half of the
- code page by the use of Locking Shift 3 Right (LS3R).
-
- This corresponds to the ISO 4873 "third level" application.
-
-
- (5) If avoidable, use of character sets with floating accents,
- like ISO 6937, should be avoided.
-
- (6) The shifts changing the lower half of the code table (SI/SO,
- LS2 and LS3) should NOT be used.
-
-
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 9]
-
- draft X.400 use of extended character sets Apr 92
-
-
- RATIONALE: Keeping the G0 set reserved for ASCII will ensure that
- text in ASCII has the same bit representation always.
-
- The use of the upper code page for other scripts ensures that both
- text in these languages and text of this type mixed with English
- can be represented without the use of shift sequences.
-
- If the language and/or content of a text is completely unknown,
- chapter 5 gives an algorithm that may be used to decide upon the
- character sets. This might, for instance, be suitable for use at
- automatic mail gateways.
-
- NOTE: At the time of this writing, few applications that use ISO
- 4873 level 2 and level 3 encoding exist. It has been estimated
- that implementing them in an application that already uses a rich
- repertoire of characters is a matter of programmer-days, not
- programmer-months, but this has not been proven.
-
-
- 4. GUIDELINES FOR THE RENDERING OF GENERALTEXT
-
- As a basic rule, one should NOT assume that any of the rules above
- are followed.
-
- An user agent capable of rendering GeneralText should:
-
-
- (1) ALWAYS be able to identify and render characters in IA5, no
- matter how they are designated and invoked.
-
- (2) ALWAYS be able to identify and render characters in the
- "native" character sets, no matter how they are designated
- and invoked.
-
- (3) ALWAYS indicate the presence of characters that cannot be
- adequately represented on the current output device.
-
- (4) NEVER render a character in an unknown or unrepresentable
- character set by displaying the character in the same bit
- position in the native character set.
-
- (5) PREFERABLY be able to identify and render characters that are
- the same as characters in the "native" character sets, even
- though they are designated and invoked as part of other
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 10]
-
- draft X.400 use of extended character sets Apr 92
-
-
- character sets. This applies in particular to the
- "invariant" part of ISO 8859, parts 1 through 6.
-
- (6) PREFERABLY be able to combine the floating accents of ISO
- 6937 with their base characters for suitable rendering using
- the capabilities of the current output device.
-
- (7) PREFERABLY be able to display text both in a mode using
- fallbacks for nonrenderable characters and in a mode
- designating nonrenderable characters as such.
-
- (8) PREFERABLY be able to save the content of a GeneralText
- message to a file or other suitable media, saving all
- character set information, for later processing by other
- means. It is not illegal to render the character set
- information into a different format; however, it should be
- noted that it is easy to lose vital information if the format
- chosen for representing character sets does not offer the
- possibility of referencing all character sets in the ISO
- registry of character sets.
-
- These requirements also apply to gateways that transform the
- message into some other format, for example a gateway that
- transforms a message into MIME using [RFC-2022-JP] for the
- purpose.
-
-
- 5. RECOMMENDATION FOR SELECTION OF CHARACTER SETS
-
-
- 5.1. Algorithm for selection of character sets
-
- When one has text in which characters from several character sets
- occurs, and wants to process this into a GeneralText document, it
- is often hard to guess right at the character sets to select.
-
- The following paragraphs give an algorithm that can be started at
- the beginning of a message, and at the end of it, return a set of
- character sets that can be used as G0..G3 character sets, OR an
- indication that the task is impossible.
-
- VARIABLES:
-
-
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 11]
-
- draft X.400 use of extended character sets Apr 92
-
-
- UsedSets
- The set of character sets that MUST be used for this message
-
- UsableSets
- The set of character sets that MAY be used for this message.
- Each set also contains a counter for each character position.
-
- PossibleSets
- The set of all the character sets known to be usable in the
- destination format.
-
- ALGORITHM:
-
- 1) Add IA5 (ISO-IR-6) to the UsedSets (as G0).
-
- 2) Get the next character of the text. If the text is
- completely analyzed, go to FINISHED
-
- 3) If it is in the UsedSets, go to 2).
-
- 4) Find the set of character sets from PossibleSets in which the
- character occurs. If it does not occur in any, report
- failure.
-
- 5) If it is in a single character set in PossibleSets only, add
- this set to UsedSets, and go to 2).
-
- 6) If it is in more than one character set, add these to
- PossibleSets (if not already present), and increment the
- counter for that character in all the sets. Go to 2).
-
- FINISHED)
-
- 1) (FINAL SELECTION) Remove any character set in UsedSets from
- PossibleSets.
-
- Zero the counters for any character in PossibleSets that also
- occurs in UsedSets.
- WHILE (more characters left)
- Select one character set and move it from PossibleSets to UsedSets.
- Zero the counters for all characters in this set in the other
- PossibleSets.
- END WHILE
- This step can be "tuned" any way you want, for instance by
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 12]
-
- draft X.400 use of extended character sets Apr 92
-
-
- choosing the character sets most likely to be understood at
- the destination first, choosing the character sets covering
- the most characters first, avoiding multi-byte character sets
- as long as possible, or any other scheme suitable for the
- application.
-
- 5.2. WHAT TO DO ON FAILURE
-
- Failure will occur in this schema if a character is found that is
- not in the PossibleSets. It may then be handled in one of the
- following ways:
-
- (1) Replace the character with the SUB control character
-
- (2) Replace the character with Keld Simonsen Mnemonics. This is a
- reversible transformation as long as the recipient is aware
- that it has been used, but requires passing out-of-band
- information to indicate this.
-
- (3) Replace the lost characters with any suitable fallback or
- mnemonic scheme intended for human understanding
-
- (4) Bounce the message/refuse the conversion/give up.
-
- The action to be taken may be different based on the percentage of
- "lost" characters.
-
- If the message has "controls" like "conversion with loss
- prohibited", only the last possibility may be used.
-
-
- 5.3. RECOMMENDED CHARACTER SETS
-
- There are 2 steps in the algorithm above that are left for local
- judgement:
-
- (1) Selection of the sets to appear in PossibleSets.
-
- (2) The algorithm for deciding which character set to select in
- step 9.
-
- In the context of generating X.400 GeneralText messages, the
- following is recommended:
-
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 13]
-
- draft X.400 use of extended character sets Apr 92
-
-
- Sets in PossibleSets:
- ISO-IR-6 Esc 28 42 (G0) US-ASCII, IA5, ISO646
- ISO-IR-100 Esc 2D 41 (G1) ISO-8859-1 West Europe
- ISO-IR-101 Esc 2D 42 (G1) ISO-8859-2 Central/Eastern Europe
- ISO-IR-144 Esc 2D 4C (G1) ISO-8859-5 Cyrillic
- ISO-IR-127 Esc 2D 47 (G1) ISO-8859-6 Arabic
- ISO-IR-126 Esc 2D 46 (G1) ISO-8859-7 Greek
- ISO-IR-138 Esc 2D 48 (G1) ISO-8859-8 Hebrew
- ISO-IR-148 Esc 2D 4D (G1) ISO-8859-9 Baltic/Nordic/Turkish
-
- The following multi-byte character sets are recommended:
-
- ISO-IR-87 (Japanese JIS C6226-1983) Esc 24 29 42 (G1)
- ISO-IR-149 (Korean KS C 5601-1989) Esc 24 29 43 (G1)
- ISO-IR-58 (Chinese GB 2312-80) Esc 24 29 41 (G1)
-
- It is a STRONG recommendation that character sets not listed
- above, which do not add any new characters to the total set of
- characters given by the character sets above, should NOT be used
- in X.400 interchange.
-
- ISO-IR-87 is the Japanese character set that is allowed in a
- Teletex string, such as the subject field.
-
- NOTE: ISO-IR-87 has been "superseded" by ISO-IR-168, which allows
- two extra Kanji characters. Any application that handles ISO-IR-87
- should also be able to handle ISO-IR-168.
-
- Algorithm for selecting character sets:
-
- Start at the top of the list above, and add each set only if it is
- needed.
-
-
- 5.4. Selecting a character set based on language
-
- If the most common language of the environment in which it is used
- is known, the following character sets are recommended.
-
- The table of Latin-script languages is based on work by Johan van
- Wingen. <BUTPAA@rulmvs.leidenuniv.nl>. The others are best
- guesses by the author.
-
- The tables of character sets prepared by Keld Jorn Simonsen
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 14]
-
- draft X.400 use of extended character sets Apr 92
-
-
- <keld@dkuug.dk> (RFC-KELD) were invaluable in matching the data on
- languages to the data on character sets.
-
- Again, these are intended for guidance, not enforcement; there is
- considerable prestige atttached to such recommendations in other
- contexts, and it is therefore likely that each language group will
- make appropriate decisions on this subject. The table below is
- intended as a compilation of existing knowledge, again on the
- principle that it is better to say something than to say nothing.
-
- The language codes (for those languages that have codes) come from
- ISO 639.
-
- NOTE: ISO 639 is a very incomplete list of the world's languages
- (perhaps 10 or 20 % according to some experts), and is undergoing
- revision. The only reason for using it is that it is the only
- ISO-standardized shorthand notation for languages available at the
- moment.
-
- Language 1 2 3 4 5
- ------------------------------------------------------------
- sq Albanian X X X
- eu Basque X X
- br Breton X
- hr Croatian X
- cs Czech X
- da Danish X
- eo Esperanto X
- fo Faeroese X
- fi Finnish X X X
- fy Frisian X
- ?? Gaelic X
- gl Galician X X
- de German X X
- hu Hungarian X
- is Icelandic X
- ga Irish X X X
- it Italian X
- no Norwegian X X
- pl Polish X
- pt Portuguese X
- ?? Rhaetian X
- ro Romanian X
- sk Slovak X
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 15]
-
- draft X.400 use of extended character sets Apr 92
-
-
- sl Slovenian X
- ?? Sorbian X
- es Spanish X X
- sv Swedish X X
- tr Turkish X
-
- Explanation of character set codes
- ----------------------------------------
- 1: ISO_8859-1:1987
- 2: ISO_8859-2:1987
- 3: ISO_8859-9:1989
- 4: ISO_8859-supp
- 5: ISO_8859-2:1987 and ISO_8859-supp
-
-
- Other languages for which appropriate character sets are known are
- listed in the table below.
-
- Language Character set
-
- ar Arabic ISO-8859-6
- be Byelorussian ISO-8859-5
- bg Bulgarian ISO-8859-5
- el Greek ISO-8859-7
- en English USASCII
- fa Persian ISO-8859-6
- iw Hebrew ISO-8859-8
- ja Japanese ISO-IR-87 (Japanese JIS C6226-1983)
- ko Korean ISO-IR-149 (Korean KS C 5601-1989)
- la Latin USASCII
- lo Laotian ISO-IR-166
- ru Russian ISO-8859-5
- sw Swahili USASCII
- th Thai ISO-IR-166
- uk Ukrainian ISO-8859-5
- ur Urdu ISO-8859-6
- vo Volapuk ISO-8859-1
- zh Chinese ISO-IR-58 (Chinese GB 2312-80)
-
- Additional entries in this table are welcome!
-
- Some languages have only one or a few characters missing. These
- are listed below.
-
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 16]
-
- draft X.400 use of extended character sets Apr 92
-
-
- Language Character set Missing
-
- Sami ISO-8859-9 C with caron
- D with stroke
- I with diaeresis
- N with acute
- Eng
- S with caron
- T with stroke
- Z with caron
- kl Greenlandic ISO-8859-1 I with tilde
- K with cedilla
- U with tilde
- cy Welsh ISO-8859-1 W with acute
- W with grave
- W with diaeresis
- Y with grave
- Y with circumflex
- nl Dutch ISO-8859-1 Ligature IJ
- af Afrikaans ISO-8859-1 N preceded by apostrophe
- fr French ISO-8859-1 Ligature OE
- ca Catalan ISO-8859-1 L with middle dot
-
- According to comments received, the "problem characters" for
- Dutch, Afrikaans, French, Greenlandic and Catalan are not in
- common use, or may be avoided by use of alternate spelling (like
- using "ij" instead of the "Ligature IJ").
-
- For French, Dutch, Catalan and Afrikaans, the character set ISO
- 6937-2, which uses floating diacritical marks, contains all
- required characters.
-
- The following languages can (to the author's limited knowledge) be
- written with the current ISO 10646 standard, but with no other
- registered character sets:
-
-
- Language Country(ies) Script(s)
-
- aa Afar Somalia, Ethiopia, Djibouti Latin
- ab Abkhazian Georgia Cyrillic
- am Amharic Ethiopia Ethiopic
- as Assamese India, Nepal Bengali
- ay Aymara Bolivia, Peru, Chile Latin
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 17]
-
- draft X.400 use of extended character sets Apr 92
-
-
- az Azerbaijani SNC, Iran, Iraq, Turkey Cyrillic, Arabic
- ba Bashkir SNC Cyrillic
- bh Bihari India Gujarati (or Kaithi)
- bi Bislama Vanuatu, New Caledonia Latin
- bn Bengali India Bengali
- co Corsican France Latin
- fj Fiji Fiji Latin
- gd Scots UK Latin
- gn Guarani Paraguay Latin
- gu Gujarati India Gujarati
- ha Hausa Nigeria, Niger, Chad, Sudan,... Latin
- hi Hindi India Devanagari
- hy Armenian Armenia Armenian
- ia Interlingua None (Artificial Language) Latin
- ie Interlingue None (Artificial Language) Latin
- ik Inupiak USA, Cannada Latin, Cree
- in Indonesian Indonesia Latin
- ji Yiddish Germany, USA, SNC, Israel Hebrew
- jw Javanese Indonesia, Malaysia Latin, Javanese
- ka Georgian Georgia Georgian
- kk Kazakh SNC, Afghanistan Cyrillic, Arabic
- km Cambodian Cambodia Khmer
- kn Kannada India Kannada
- ks Kashmiri India, Pakistan Arabic
- ku Kurdish SNC, Turkey, Iraq, Iran Cyrillic, Arabic
- ky Kirghiz SNC, China, Afghanistan Cyrillic, Arabic
- ln Lingala CAR, Congo, Zaire Latin
- mg Malagasy Madagascar, Comoro Islands Latin, Arabic
- mi Maori New Zealand Latin
- mk Macedonian Greece, Yugoslavia Greek, Cyrillic
- ml Malayalam India Malayalam
- mn Mongolian Mongolia Cyrillic, Mongolian
- mo Moldavian Romania Latin
- mr Marathi India Devanagari
- ms Malay Malaysia, Thailand Latin
- my Burmese Myanmar Burmese
- na Nauru Nauru Latin
- ne Nepali Nepal Devanagari
- oc Occitan France Latin
- or Oriya India Oriya
- pa Punjabi India Gurmukhi
- ps Pashto (Western) Afghanistan, Iran Arabic
- qu Quechua Peru Latin
- rm Rhaeto Swizerland Latin
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 18]
-
- draft X.400 use of extended character sets Apr 92
-
-
- rn Kirundi Burundi, Uganda Latin
- rw Kinyarwanda Rwanda, Uganda, Zaire Latin
- sa Sanskrit India Devanagari
- sd Sindhi Pakistan, India, Afghanistan Arabic, Gurmukhi
- sg Sangro Central African Republic Latin
- si Singhalese Sri Lanka Sinhalese
- sm Samoan Samoa, USA, New Zealand Latin
- sn Shona Zimbabwe, Zambia, Mozambique Latin
- so Somali Somalia, Ethiopia, Djibouti Latin
- sr Serbian former Yugoslavia Cyrillic
- ss Siswati S. Africa, Swaziland Latin
- st Sesotho S. Africa, Lesotho Latin
- su Sudanese Sudan Latin
- ta Tamil India, Malaysia Tamil
- te Tegulu India Telugu
- tg Tajik Tajikistan Arabic
- ti Tigrinya Ethiopia Latin, Ethiopic
- tk Turkmen SNC, Iran, Afghanistan Cyrillic, Arabic
- tl Tagalog Phillipines Latin
- tn Setswana S. Africa, Botswana, Namibia Latin
- to Tonga (3) Mozambique Latin
- ts Tsonga Mozambique, Swaziland Latin
- tt Tatar SNC Cyrillic
- tw Twi (Ewe) Ghana Latin
- uz Uzbek (Southern) Afghanistan, Turkey Arabic
- vi Vietnamese Vietnam, Cambodia, China Latin
- wo Wolof Senegal, Mauritania Latin
- xh Xhosa S. Africa Latin
- yo Yoruba Nigeria, Togo, Benin Latin
- zu Zulu S. Africa, Lesotho, Malawi Latin
-
-
- The information about languages in ISO 10646 was kindly supplied
- by Glenn Adams <glenn@metis.com>
-
- Languages for which the author does NOT know any proper character
- set include:
-
-
- bo Tibetan
- dz Bhutani
- et Estonian
- lt Lithuanian
- lv Latvian, Lettish
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 19]
-
- draft X.400 use of extended character sets Apr 92
-
-
- mt Maltese
- sh Serbo-Croatian
-
-
-
- 6. REFERENCES
-
-
- [ISO 4873]
- <<title coming>> 1991 revision. Replaces ISO 2022
-
- [ISO 8859]
-
- [ISO 6937]
-
- [ISO 639]
-
- [X.209]
- CCITT Recommendation X.209(1988): Specification of Basic
- Encoding Rules for Abstract Syntax Notation One (ASN.1).
- Technically aligned with ISO 8825 and ISO 8825/AD 1.
-
- [ISO 10646]
-
- [RFC-2022-JP]
-
- [RFC-KELD]
-
-
- 7. Missing items This section is intended as a memory aid for
- the author, and should be empty by the time the RFC is published.
-
- (1) Get exact escape sequence information for ISO 10646
-
- (2) Full titles in the references section
-
- (3) Consider number of lines when listing extra chars in
- languages in cleartext
-
- (4) Check Sami character set with Sami school
-
- (5) Locate (Norwegian) editor of revision for ISO 639 and get
- language codes for Sorbian and Sami, if possible
-
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 20]
-
- draft X.400 use of extended character sets Apr 92
-
-
- (6) Add MOTIS properly to reference list
-
- (7) Add Johan van Wingen's E-mail address
-
- (8) Number and reference entry for RFC-KELD
-
- (9) Check for references to/copies of Johan van Wingen's work
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Alvestrand Expires May 6 93 [Page 21]
-
-
- ------------------------------ End of body part 2
-